{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# word2vec\n",
    "---\n",
    "Word2Vec 是一种著名的 词嵌入（Word Embedding） 方法，它可以计算每个单词在其给定语料库环境下的 分布式词向量（Distributed Representation，亦直接被称为词向量）。词向量表示可以在一定程度上刻画每个单词的语义。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 简单用法\n",
    "---\n",
    "### 读取语料\n",
    "---\n",
    "* class gensim.models.word2vec.BrownCorpus（dirname ）\n",
    "从布朗语料库（NLTK数据的一部分）迭代句子,dirname是存储布朗语料库的根目录(通过nltk.download()下载布朗语料库)，得到的这个对象可以通过循环迭代语料库的句子。\n",
    "\n",
    "* class gensim.models.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)\n",
    "与上一样，也是产生迭代器，但需要更改下文件格式。简单的格式：一篇文档=一行; 单词已经过预处理并由空格分隔。\n",
    "\n",
    "* class gensim.models.word2vec.PathLineSentences（source，max_sentence_length = 10000，limit = None ）\n",
    "与LineSentence类一样，不过这里是处理根目录下的所有文件，同样文件中句子格式需要处理\n",
    "\n",
    "* class gensim.models.word2vec.Text8Corpus（fname，max_sentence_length = 10000 ）\n",
    "从text8语料库中迭代句子\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.models import word2vec\n",
    "\n",
    "file_path = 'word2vec训练数据.txt'\n",
    "# 使用LineSentence读取语料\n",
    "sentences = word2vec.LineSentence(file_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 训练word2vec语义向量\n",
    "---\n",
    "```python\n",
    "class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5,  \n",
    "                   max_vocab_size=None, sample=1e-3, seed=1, workers=3, min_alpha=0.0001,  \n",
    "                   sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,  \n",
    "                   trim_rule=None, sorted_vocab=1, batch_words=MAX_WORDS_IN_BATCH, \n",
    "                   compute_loss=False,callbacks=(), \n",
    "                   max_final_vocab=None)  \n",
    "```\n",
    "* sentence(iterable of iterables):可迭代的句子可以是简单的list，但对于较大的语料库，可以考虑直接从磁盘/网络传输句子的迭代。见BrownCorpus，Text8Corpus 或LineSentence.\n",
    "* SG(INT {1 ，0}) -定义的训练算法。如果是1，则使用skip-gram; 否则，使用CBOW。\n",
    "* hs：是否采用基于Hierarchical Softmax的模型。参数为1表示使用，0表示不使用\n",
    "* size(int) - 特征向量的维数。\n",
    "* window(int) - 句子中当前词和预测词之间的最大距离。\n",
    "* min_count(int) - 忽略总频率低于此值的所有单词。 \n",
    "\n",
    "    关于Hierarchical Softmax与negative sampling，可以参考以下博客:  \n",
    "        http://www.cnblogs.com/pinard/p/7243513.html  \n",
    "        https://www.cnblogs.com/pinard/p/7249903.html  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=2,size=128)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/zu/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n",
      "  \"\"\"Entry point for launching an IPython kernel.\n"
     ]
    },
    {
     "ename": "KeyError",
     "evalue": "\"word 'sfsdf' not in vocabulary\"",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-13-383985dbf5dd>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmodel\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"sfsdf\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/gensim/utils.py\u001b[0m in \u001b[0;36mnew_func1\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m   1445\u001b[0m                     \u001b[0mstacklevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m2\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1446\u001b[0m                 )\n\u001b[0;32m-> 1447\u001b[0;31m                 \u001b[0;32mreturn\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1448\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1449\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mnew_func1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/gensim/models/word2vec.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, words)\u001b[0m\n\u001b[1;32m   1119\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1120\u001b[0m         \"\"\"\n\u001b[0;32m-> 1121\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwv\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getitem__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mwords\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1122\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1123\u001b[0m     \u001b[0;34m@\u001b[0m\u001b[0mdeprecated\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Method will be removed in 4.0.0, use self.wv.__contains__() instead\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/gensim/models/keyedvectors.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, entities)\u001b[0m\n\u001b[1;32m    351\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mentities\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstring_types\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    352\u001b[0m             \u001b[0;31m# allow calls like trained_model['office'], as a shorthand for trained_model[['office']]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 353\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_vector\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mentities\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    354\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    355\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0mvstack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_vector\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mentity\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mentity\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mentities\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/gensim/models/keyedvectors.py\u001b[0m in \u001b[0;36mget_vector\u001b[0;34m(self, word)\u001b[0m\n\u001b[1;32m    469\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    470\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mget_vector\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mword\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 471\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mword_vec\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mword\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    472\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    473\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mwords_closer_than\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mw1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mw2\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/lib/python3.7/site-packages/gensim/models/keyedvectors.py\u001b[0m in \u001b[0;36mword_vec\u001b[0;34m(self, word, use_norm)\u001b[0m\n\u001b[1;32m    466\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    467\u001b[0m         \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 468\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"word '%s' not in vocabulary\"\u001b[0m \u001b[0;34m%\u001b[0m \u001b[0mword\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    469\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    470\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mget_vector\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mword\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyError\u001b[0m: \"word 'sfsdf' not in vocabulary\""
     ]
    }
   ],
   "source": [
    "model[\"sfsdf\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 保存模型\n",
    "---\n",
    "model.save(file_name)\n",
    "* file_name:存储模型的名称"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.save('model_word2vec_test')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 加载模型\n",
    "---\n",
    "word2vec.Word2Vec.load(file_name)\n",
    "* file_name:存储的模型的名称"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = word2vec.Word2Vec.load('model_word2vec_test')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/zu/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n",
      "  \n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([-0.05317169, -0.15755793, -0.06250006, -0.01180194,  0.04724145,\n",
       "       -0.26425636, -0.01873707, -0.17131071,  0.03399659, -0.29721585,\n",
       "       -0.23619682, -0.00187487, -0.03608605, -0.23502819,  0.1677451 ,\n",
       "        0.03139535, -0.04964776, -0.20321703,  0.03333145, -0.26338065,\n",
       "       -0.32002532,  0.2947625 ,  0.18000598, -0.03674206,  0.02373896,\n",
       "        0.07094915,  0.11757199, -0.22713898, -0.3052054 , -0.16245638,\n",
       "        0.23368882,  0.11960867, -0.08898318,  0.06621196,  0.1065318 ,\n",
       "       -0.0490705 , -0.16302302, -0.01123774, -0.11983654, -0.02068417,\n",
       "        0.2732088 ,  0.2906885 , -0.13971347,  0.07436862, -0.21679416,\n",
       "       -0.13932157, -0.06862234,  0.06071692,  0.04861921, -0.00767402,\n",
       "       -0.00748343, -0.12588996,  0.2271392 ,  0.1355575 , -0.04126433,\n",
       "        0.02798286,  0.11357804,  0.07826936,  0.1436871 ,  0.09072269,\n",
       "       -0.27041167, -0.13087061, -0.03427159, -0.34049234, -0.05541604,\n",
       "       -0.09847128, -0.15053135, -0.11456229, -0.19661945,  0.15286963,\n",
       "        0.02494534, -0.25840554, -0.21153094, -0.08175035,  0.18903485,\n",
       "        0.08281357,  0.11453687, -0.08251194, -0.46239945, -0.01451807,\n",
       "       -0.14279196,  0.41197246, -0.3463378 ,  0.5971489 ,  0.14728662,\n",
       "       -0.18099079,  0.06169752, -0.09317809, -0.30950165, -0.14726709,\n",
       "       -0.06563608,  0.4886801 , -0.15419646,  0.15854667, -0.29109967,\n",
       "       -0.02661166,  0.05160856,  0.3174907 , -0.07806236, -0.10281344,\n",
       "        0.34013486,  0.05017262,  0.24028341, -0.18009193, -0.1048708 ,\n",
       "       -0.25016692,  0.0745217 , -0.10217225,  0.11373767,  0.21968596,\n",
       "       -0.11322594, -0.0408846 ,  0.17337096,  0.25807914,  0.00826386,\n",
       "        0.17644595,  0.23506783,  0.22146137,  0.03455357,  0.18263113,\n",
       "        0.28811708, -0.09023061,  0.11266516, -0.17953789,  0.06702979,\n",
       "        0.0907107 , -0.01186496,  0.03114617], dtype=float32)"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 获取单词word2vec值\n",
    "model['apple']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.662871\n",
      "0.9732659\n",
      "0.97379863\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/zu/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).\n",
      "  \n",
      "/Users/zu/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).\n",
      "  This is separate from the ipykernel package so we can avoid doing imports until\n",
      "/Users/zu/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: DeprecationWarning: Call to deprecated `similarity` (Method will be removed in 4.0.0, use self.wv.similarity() instead).\n",
      "  after removing the cwd from sys.path.\n"
     ]
    }
   ],
   "source": [
    "# 计算两个单词的语义相似度\n",
    "print(model.similarity('魅族','4g'))\n",
    "print(model.similarity('16g','64g'))\n",
    "print(model.similarity('粉色','金色'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 网上中文语料库\n",
    "腾讯AI实验室宣布，正式开源一个大规模、高质量的中文词向量数据集  \n",
    "https://ai.tencent.com/ailab/nlp/embedding.html  \n",
    "120G+训练好的word2vec模型（中文词向量）  \n",
    "https://blog.csdn.net/tu_22/article/details/79035769  \n",
    "还有一些开源语料库，可以自己拿去训练。。。。。。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "[word,word]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "中国\n",
    "0.06030092,  0.13926093,  0.14931089,  0.09587928,  0.01172079,\n",
    "        0.16481434, -0.03030733, -0.00806136, -0.16576375,  0.07409613,\n",
    "       -0.14100142,  0.23502573, -0.07417766,  0.10642188,  0.015695  ,\n",
    "       -0.37878802,  0.07142749, -0.09509656,  0.36309183, -0.04938255,\n",
    "        0.26763403, -0.2751151 ,  0.07805856,  0.24847876, -0.02804651,\n",
    "        0.05709159,  0.18295015,  0.2644478 , -0.03482464,  0.06647653,\n",
    "       -0.08666561, -0.36904642, -0.10884262, -0.01698281, -0.29763272,\n",
    "       -0.03631932, -0.2168453 , -0.04120192,  0.12301973, -0.1818739 ,\n",
    "       -0.26965177,  0.04479553, -0.20034713, -0.06285373,  0.12583879,\n",
    "       -0.07162984,  0.03741832,  0.02025967, -0.14508504,  0.20582141,\n",
    "       -0.2006449 , -0.03148303,  0.36854672, -0.1676803 , -0.05470579,\n",
    "       -0.20834959, -0.4159349 , -0.15261981,  0.06619572, -0.0995653 ,\n",
    "       -0.17534412,  0.07112484,  0.0081148 ,  0.1560343 , -0.1809818 ,\n",
    "        0.07236055, -0.15453903, -0.1910816 ,  0.06993034, -0.22810018,\n",
    "        0.2847329 , -0.08550414,  0.11128701,  0.11653139, -0.11435539,\n",
    "        0.02566967,  0.08604472, -0.48752505, -0.4555698 ,  0.10849363,\n",
    "        0.15700744,  0.104773  , -0.03272085, -0.33915278,  0.21981613,\n",
    "        0.08956464,  0.26877728, -0.25333792, -0.05390652, -0.07425766,\n",
    "        0.32495135, -0.02435999,  0.29123324,  0.22463351,  0.19677334,\n",
    "       -0.03920394,  0.22039285,  0.30304262,  0.06015437, -0.04324128,\n",
    "        0.2037986 , -0.01792921, -0.3117134 , -0.0888361 ,  0.0633732 ,\n",
    "        0.03801356,  0.13587296, -0.01953664, -0.00784068,  0.08482304,\n",
    "       -0.28995574,  0.28048077, -0.07051264, -0.10778372, -0.21289313,\n",
    "       -0.22733615, -0.09544067, -0.2891292 , -0.0273876 , -0.09545469,\n",
    "        0.08318584, -0.21544272, -0.22302033,  0.09569433,  0.32523558,\n",
    "       -0.35689282, -0.19728719,  0.2617775\n",
    "天学网\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
